It goes without saying that divorce and the stories behind it are becoming heated topics in the contemporary society. One can hardly go anywhere without hearing about the shockingly high divorce rates constantly broadcasted by the press and media … In this script, we focus on a couple of general factors that are associated with the end of marriages.
Firstly, we took a look at the distinct divorce rates in different states. Then we examined separately the relationships between the divorce rates and the contributing factors of interests, including education levels, income levels, race, occupations, and working hours per week. Moreover, following the thread of curiosity, we took a closer look at several specific topics, such as which race demonstrates a relatively higher divorce rate, whether people working in the financial industry are more prone to divorce, or how the financial conditions and working time affect the qualities of marriages? Finally, we compared and analyzed two states that illustrated the highest and lowest divorce rates based on our former results.
library(data.table)
library(dplyr)
varToKeep <- c("PWGTP","ST","SCHL","AGEP","SEX","MSP","WAGP","CIT","COW","WKHP","RAC1P","JWMNP","MARHT","MARHYP","RETP","ANC","DIS","ESR","FOD1P","NATIVITY","OC","OCCP","PERNP","MAR","POVPIP","QTRBIR")
data1 <- fread('ss13pusa.csv',select = varToKeep)
data2 <- fread('ss13pusb.csv',select = varToKeep)
dataAll <- rbind(data1,data2)
# change factors married status
dataAll$MAR <- factor(dataAll$MAR)
marryStatus <- c("Married","Widowed","Divorced","Separated","Never married or under 15 years old")
levels(dataAll$MAR) <- marryStatus
Our group chose personal data in 2013. So unlike divorce to marriage ratio, which is the number of divorces to the number of marriages in a given year, we define the divorce rate in our report as the number of people who have ever got divorced over the number of people who have ever got married. This definition is based on features of this particular personal data.
#calculate the divorced rate by state
marriedData_divorced<-right_join(marriedData_divorced,statenames,by.x=c('ST'))
marriedData_divorced<-mutate(marriedData_divorced,value=round(marriedData_divorced$PWGTP/marriedData_total$PWGTP,2))
marriedData_for_draw<-marriedData_divorced[,c(1,3)]
colnames(marriedData_for_draw)<-c('region','value')
#draw the plot
library(choroplethrMaps)
library(choroplethr)
state_choropleth(marriedData_for_draw,title='Divorce Rate by State',legend='rate',num_colors = 5)
As we can see,divorce rates vary from different states quite much, some states like NY,CA have low divorce rate while some states like NV or OK have high divorce rates, we’d like to find out what variables inflence the divorce rates.
By categorizing the education levels displayed in this data into 8 subgroups, from having attended “Elementary School” to obtaining a “Doctorate Degree”, we hope to investigate in details how a variety of education levels would impact the divorce rates in the States.
EduLevel=c("Elementary School", "Middle School","High School","Some Degree","Bachelor's Degree",
"Master's Degree","Professional Degree","Doctorate Degree")
Edu_DivRate=c(ElemSchl_DivRate,MiddleSchl_DivRate,HighSchl_DivRate,SomeDegree_DivRate,Bach_DivRate,
Master_DivRate,Pro_DivRate,Doc_DivRate)
# round up the rates to make graphs look nicer
Edu_DivRate_2d=round(100*Edu_DivRate, digits=2)
# plot the graphs
library(highcharter)
highchart() %>%
hc_chart(margin=130,height=600)%>%
hc_chart(type = "pyramid")%>%
hc_add_series(
name ="Education Level Percentages",
data = list_parse(
data.frame(name = EduLevel,
y = Edu_DivRate_2d)))
With respect to the education levels, we particularly selected the “Pyramid” chart to illustrate the gradual learning process of a person. As seen in the graphic above, the heights of each trapezoid or triangle represent the average divorce rate of that subgroup respectively. Intuitively, the larger the height of each section, the higher its calculated divorce rate is. For instance, it’s shown that people of “Some Degrees” (between high schools and bachelor’s degrees) have the highest divorce rates of 42%.
Of course, our results are inconsistencies with the common sense that people who are well-educated tend to stay in marriages. Surprisingly, however, people with doctorate degrees are slightly more likely to get divorced than those with master’s or bachelor’s degrees.
We calculate the divorce rates among 9 different races accordingly.
highchart() %>%
hc_title(text = "Divorce rate by race") %>%
hc_add_series_labels_values(dataRACE$RACE, round(dataDivorcedRACE[,2] / dataRACE[,2]*100,2), name = "Pie",
colorByPoint = TRUE, type = "column") %>%
hc_add_series_labels_values(dataRACE$RACE, dataRACE$PWGTP/sum(dataRACE$PWGTP),
type = "pie",
name = "Bar", colorByPoint = TRUE, center = c('85%', '10%'),
size = 100, dataLabels = list(enabled = TRUE)) %>%
hc_yAxis(title = list(text = "Divorce Rate"),
labels = list(format = "{value}%"), max = 100) %>%
hc_xAxis(categories = dataRACE$RACE) %>%
hc_legend(enabled = FALSE) %>%
hc_tooltip(pointFormat = "{point.y}%")
As the plot shown above, the Asian people demonstrated the lowest divorce rate while the Black or Afican American, American Indian or Alaska Native have shown higher divorce rate comparing to the others. And the difference of divorce rate in different races might be caused by the cultural differences.
We divided occupations to 25 different industries in the hope of getting to know the divorce rates regarding different industries.
# draw bar plot(interactive)
library(plotly)
library(reshape2)
p <- ggplot(DF.OCC, aes(x = Industry, y = DivorceRate)) +
geom_bar(stat = "identity",fill = "steelblue")+labs(title="Divorce rate in different Industry")+
ylab("divorce rate") + theme_minimal()
ggplotly(p)
##### let's calculate the divorce rate for different gender
dataOCC1 <- aggregate(marriedData$PWGTP, by = list(marriedData$OCCP1,marriedData$SEX),FUN = sum)
dataM.OCC1 <- dataOCC1[1:25,c(1,3)]
dataM.OCC2 <- dataOCC1[26:50,3]
dataM.OCCnew <- cbind(dataM.OCC1,dataM.OCC2)
names(dataM.OCCnew) <- c('OCCP','PWGTP.married.man','PWGTP.married.woman')
It is said that men working in the computer science industry are more reliable than those working in the financial industry. But is it really true regarding the divorce rates?
divorce rate for men and women.
interactive graph and more details
Actually, we can see that people that work in both financial, computer science industries have low divorce rates. If we only consider about the probability of getting divorced, marrying a banker or marrying a software engineer are almost the same.
According to the plots above, we can see that people who work in CMM(computer science), EDU(computer), ENG(engineering), FFF(farming,fishing and forest), MIL(military) and SCI (science) have relatively lower divorce rates than people in other industries. Overall, the divorce rates for women are higher than men. And women in the entertainment industry has the highest divorce rate which is 58.6%.
We will first check the relationship between the last 12 month income with divorce rate and the relationship between the work hour per week with divorce rate, then we will combine the two variables. Specificly, we divided income and work hour into 11 and 10 categories respectfully.
#calculate population of divorced people by different salary levels
dataDivorcedSalary <- aggregate(everDivorcedData$PWGTP,by=list(everDivorcedData$WAGP2),FUN=sum)
names(dataDivorcedSalary)<-c('WAGP','PWGTP')
dataSalary <- as.vector(dataSalary)
dataDivorcedSalary <- as.vector(dataDivorcedSalary)
income_divorced_rate_frame=data.frame(rate=dataDivorcedSalary[,2] / dataSalary[,2],
income=sort(unique(marriedData$WAGP2)))
# get the proportion of divorced according to work hour per week
hour_divorced_rate_frame=data.frame(rate=dataDivorcedWKHP[,2] / dataWKHP[,2],
hour=c(0,5,15,25,35,45,55,65,75,85,95))
#income against divorce rate
library(ggplot2)
income_divorced_rate_plot <- ggplot(data=income_divorced_rate_frame, aes(x = income, y = rate))
income_divorced_rate_plot + geom_point(colour = "red", size = 1.5) + ggtitle("Income Against Divorced rate") +
labs(x="Income",y="Divorce Rates")
#work hour per week against divorce rate
hour_divorced_rate_plot <- ggplot(data=hour_divorced_rate_frame, aes(x = hour, y = rate))
hour_divorced_rate_plot + geom_point(colour = "red", size = 1.5) + ggtitle("Work Hour Against Divorced rate") +
geom_smooth(method = "lm",formula = y ~ poly(x, 2),span=1.0, se = FALSE) +
labs(x="Hour Per Week",y="Divorce Rates")
As we can see,if we only concentrate on income levels, then the higher a person makes, the lower his or her divorce rate is;If we only see working durations, then fewer and more working hours may have higher divorce rates.
library(d3heatmap)
#create a matrix so that we can load the data for plotting heatmap
heatmap_matrix<-matrix(data=NA,nrow=nrow(dataWKHP)-1,ncol=nrow(dataSalary))
for (i in 1:ncol(heatmap_matrix)){
for(j in 1:nrow(heatmap_matrix)){
heatmap_matrix[j,i]=dataDivorced_WAGP_WKHP[10*(i-1)+j,3]/data_WAGP_WKHP[10*(i-1)+j,3]
}
}
#change column and row names
dimnames(heatmap_matrix) = list(
sort(unique(dataAll$WKHP))[-1], # row names
sort(unique(dataAll$WAGP2))) # column names
d3heatmap(heatmap_matrix,Rowv=FALSE,Colv=FALSE,colors='Reds')
The heatmap tells us that if we see income levels and working hours together, then we may say that a competitive income with middle work hours will generate the lowest divorce rate.
After studying all the interested factors, let’s look back to the first map.
We picked New York and Nevada, which are states with lowest and highest divorce rates.
income_NY=
marriedData[marriedData$ST=='new york',]$PWGTP %*%
as.numeric(levels(marriedData[marriedData$ST=='new york',]$WAGP2))[marriedData[marriedData$ST=='new york',]$WAGP2]/
sum(marriedData[marriedData$ST=='new york',]$PWGTP)
income_NV=
marriedData[marriedData$ST=='nevada',]$PWGTP %*%
as.numeric(levels(marriedData[marriedData$ST=='nevada',]$WAGP2))[marriedData[marriedData$ST=='nevada',]$WAGP2]/
sum(marriedData[marriedData$ST=='nevada',]$PWGTP)
We built a score system for 5 variables: income, working hour, education level, occupation and race. We use the reciprocal of divorce rates as weights for each category of every variable so that the overall weighted scores is of the same scale. In the “spideweb” chart, we didn’t try to render a quantitative result like an exact divorce rate, we just wanted to show the qualitative results of comparing two states. Hence in this way ,we could see the main differences between two states with regards to 5 different variables.
It’s obvious that the different divorce rates between New York and Nevada are mainly caused by the differences of income levels and occupation distributions of the two states.
We can see that even the divorce rate is high, it differs in different aspects like education level, race, working industry, working hour and income. No matter who you are, if you want a stable marriage, it’s good to have a bachelor’s degree(maybe higher), work in industries like computer science, education, engineering, work an appropriate time per week(40-60 hours) and have high income. In a word, work and study hard! And don’t forget to spend time with your partner !
1.The features that we pick is based on our interests, but are they really important reasons for the end of marriage? Maybe we need further test.
2.The score system can be further improved by optimizing the weights or the model.
3.If had time, we could build a shiny app that can automate the process of calculating divorce rate and plotting graphs. And it can show you a person’s probability of getting divorced based on your input information.